Incremental Ontology-Based Extraction and Alignment in Semi-structured Documents

نویسندگان

  • Mouhamadou Thiam
  • Nacéra Bennacer
  • Nathalie Pernelle
  • Moussa Lo
چکیده

SHIRI 1 is an ontology-based system for integration of semistructured documents related to a specific domain. The system’s purpose is to allow users to access to relevant parts of documents as answers to their queries. SHIRI uses RDF/OWL for representation of resources and SPARQL for their querying. It relies on an automatic, unsupervised and ontology-driven approach for extraction, alignment and semantic annotation of tagged elements of documents. In this paper, we focus on the Extract-Align algorithm which exploits a set of named entity and term patterns to extract term candidates to be aligned with the ontology. It proceeds in an incremental manner in order to populate the ontology with terms describing instances of the domain and to reduce the access to extern resources such as Web. We experiment it on a HTML corpus related to call for papers in computer science and the results that we obtain are very promising. These results show how the incremental behaviour of Extract-Align algorithm enriches the ontology and the number of terms (or named entities) aligned directly with the ontology increases.

منابع مشابه

Annotation Semantique de Documents Semi-Structurés pour la recherche d'information. (Semantic Annotation of Semi-structured Documents for Information Retrieval)

The semantic web is defined by a set of methods and technologies enabling softwareagents to reason about the contents of Web resources. This vision of the Web depends onthe construction of ontologies and the use of metadata to represent these resources. Theobjective of our thesis is to annotate semantically tagged documents related to a domainof interest. These documents may...

متن کامل

Knowledge Extraction from Semi-structured Data Based on Fuzzy Techniques

In this work we propose a fuzzy technique to compare XML documents belonging to a semi-structured flow and sharing a common vocabulary of tags. Our approach is based on the idea of representing documents as fuzzy bags and, using a measure of comparison, evaluating structural similarities between them. Then we suggest how to organize the extracted knowledge in a class hierarchy, choosing a techn...

متن کامل

Linguistic Annotation for the Semantic Web

Establishing the semantic web on a large scale implies the widespread annotation of web documents with ontology-based knowledge markup. For this purpose, tools have been developed that allow for semi-automatic annotation of web documents with ontology-based metadata. However, given that a large number of web documents consist either fully or at least partially of free text, language technology ...

متن کامل

Ontology Based Framework for Web Page Information Extraction

Nature of Web information is dynamic and irregular that’s why it is difficult to search and integrate information from the Web. The biggest task in making WWW data accessible to users/agents is extracting the data from Web pages. We take advantage of information in existing Web pages to creating structured data semi-automatically. Extraction of information from semi-structured or unstructured d...

متن کامل

NLP-based Ontology Learning from Legal Texts. A Case Study

The paper reports on the methodology and preliminary results of a case study in automatically extracting ontological knowledge from Italian legislative texts in the environmental domain. We use a fully–implemented ontology learning system (T2K) that includes a battery of tools for Natural Language Processing (NLP), statistical text analysis and machine language learning. Tools are dynamically i...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

متن کامل
عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009